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Abstract 

We consider the problem of designing mod- 
els to leverage a recently introduced ap- 
proximate model averaging technique called 
dropout. We define a simple new model called 
maxout (so named because its output is the 
max of a set of inputs, and because it is a nat- 
ural companion to dropout) designed to both 
facilitate optimization by dropout and im- 
prove the accuracy of dropout's fast approxi- 
mate model averaging technique. We empir- 
ically verify that the model successfully ac- 
complishes both of these tasks. We use max- 
out and dropout to demonstrate state of the 
art classification performance on four bench- 
mark datasets: MNIST, CIFAR-10, CIFAR- 
100, and SVHN. 

1. Introduction 

A recently introduced technique known as dropout 
(Hinton et al., 2012) provides an inexpensive and sim- 
ple means of training a large ensemble of models that 
share parameters, as well as an inexpensive and sim- 
ple means of approximately averaging together these 
models to make a prediction. Dropout has been used 
to improve the performance of multilayer perceptrons 
and deep convolutional networks, redefining the state 
of the art on tasks ranging from audio classification to 
very large scale object recognition (Hinton et al., 2012; 
Krizhevsky et al., 2012). While dropout is known 
to work well in practice, it has not previously been 
demonstrated to actually perform model averaging for 
deep architectures. Moreover, dropout is generally 

Code associated with this paper is available at 
https://github.com/lisa-lab/pylearn2 in the module 
pylearn2 .models .maxout. 



viewed as an indiscriminately applicable tool that will 
reliably yield a modest improvement in performance 
when applied to almost any model. 

We argue that rather than using dropout as a slight 
performance enhancement applied to arbitrary mod- 
els, the best performance may be obtained by di- 
rectly considering how to use dropout as a model av- 
eraging technique, and designing a model that en- 
hances dropout's abilities in this respect. Training 
using dropout differs significantly from previous ap- 
proaches such as ordinary stochastic gradient descent. 
Dropout is most effective when taking relatively large 
steps in parameter space. In this regime, each up- 
date can be seen as making a significant update to a 
different model on a different subset of the training 
set. The ideal operating regime for dropout is when 
the overall training procedure resembles training an 
ensemble with bagging under parameter sharing con- 
straints. This differs radically from the ideal stochas- 
tic gradient operating regime in which a single model 
makes steady progress via small steps. Another impor- 
tant consideration is that dropout model averaging is 
only an approximation when applied to deep models. 
Models that are explicitly designed to minimize this 
approximation error may thus enhance dropout's per- 
formance as well. 

We propose a simple model that we call maxout that 
has beneficial characteristics both for optimization and 
model averaging with dropout. We use this model in 
conjunction with dropout to set the state of the art on 
four benchmark datasets. 

2. Review of dropout 

Dropout is a technique that can be applied to deter- 
ministic feedforward architectures that predict an out- 
put y given input vector v. These architectures contain 
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a series of hidden layers h = {h^\ . . . , h^}. Dropout 
trains an ensemble of models consisting of the set of all 
models that contain a subset of the variables in both 
v and h. The same set of parameters 9 is used to pa- 
rameterize a family of distributions p(y \ v; 9, /i) where 
/j € Ai is a binary mask determining which variables 
to include in the model. On each presentation of a 
training example, we train a different sub-model by 
following the gradient of logp(y \ x;9,fi) for a differ- 
ent randomly sampled fi. For many parameterizations 
of p (such as typical multilayer perceptrons) the in- 
stantiation of different sub- models p(y | v,fi) can be 
obtained by elementwise multiplication of v and h with 
the mask fi. This training procedure is similar to bag- 
ging (Brciman, 1994), where many different models are 
trained on different subsets of the data. Droput train- 
ing differs from bagging in that each model is trained 
for only one step and all of the models share parame- 
ters. In order for this training procedure to behave as 
if it is training an ensemble rather than a single model, 
each update must have a large effect, so that it makes 
the sub-model corresponding to that fi fit the current 
input v well. 

The functional form becomes important when it comes 
time to make a prediction by averaging together all 
models. In a typical application of bagging, the pre- 
diction is given by the arithmetic mean of all models. 
It is not obvious how to take the arithmetic mean over 
exponentially many models. One of the key insights 
of the dropout technique is that some model families 
admit a simple and inexpensive means of computing 
the geometric mean. In the case where p(y \ v; 9) = 
softmax(w T Vy + b), the predictive distribution defined 
by renormalizing the geometric mean of p(y \ v;9,n) 
over M. is simply given by softmax(?; T VF/2 + b). In 
other words, the average over the predictions of ex- 
ponentially many models can be computed simply by 
running the full model with the weights divided by 2. 
This result holds exactly in the case of a single layer 
softmax model. Previous work on dropout applies the 
same scheme in deeper architectures such as multi- 
layer perceptrons and convolutional neural networks. 
For these deeper models, this method of prediction is 
only an approximation to the geometric mean. The 
approximation has not been characterized mathemat- 
ically, but performs well in practice. 

3. Description of maxout 

The maxout model is simply a feed-forward achitec- 
ture, such as a multilayer perceptron or deep convo- 
lutional neural network, that uses a new type of ac- 
tivation function: the maxout unit. Given an input 
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a maxout hidden layer implements the func- 



hi(x) 



max Zi 
je[i,fc] ' 



where 



Zij X W...{j 



for learned parameters W £ ]^dxmxk anc j ^ g M mx ' £ . 
In the context of convolutional networks, a maxout 
feature map can be constructed by taking the max- 
imum across k affine feature maps (i.e., pool across 
channels, rather than over spatial locations). A single 
maxout unit can be interpreted as making a piecewise 
linear approximation to an arbitrary convex function. 
In other words, as the training algorithm optimizes the 
parameters, it learns not just the relationship between 
hidden units, but also the activation function of each 
hidden unit. See Fig. 1 for a graphical depiction of 
how this works. 
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Figure 1. Graphical depiction of how the maxout activa- 
tion function can implement the rectified linear, absolute 
value rectifier, and approximate the quadratic activation 
function. This diagram is two-dimensional and as such 
shows only how maxout behaves with a single input di- 
mension, but in multiple dimensions a maxout unit can 
approximate arbitrary convex functions. 

Maxout abandons many of the mainstays of traditional 
activation function design. The representation it pro- 
duces is not sparse at all (see Fig. 2), though the gra- 
dient is highly sparse and dropout will artificially spar- 
sify the effective representation during training. While 
maxout may learn to saturate on one side or the other 
this is a measure zero event. While a significant pro- 
portion of parameter space corresponds to the function 
being bounded from below, maxout is not constrained 
to learn to be bounded at all. In fact, being bounded 
from above is also a measure zero event. Maxout is 
locally linear almost everywhere, while many popular 
activation functions have signficant curvature. Given 
all of these departures from standard practice, it may 
seem surprising that maxout activation functions work 
at all, but we find that they are very robust and easy to 
train with dropout, and achieve excellent performance. 
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Distribution of maxout responses 




Figure 2. The activations of maxout units are not sparse. 
However, the gradient is sparse because the max operator 
guarantees that exactly one input filter has nonzero gradi- 
ent except for on a set of measure zero where some filter 
responses are equal. 



4. Maxout is a universal approximator 

A standard MLP with enough hidden units is a uni- 
versal approximator. Surprisingly, the maxout net- 
work requires only two maxout hidden units to be a 
universal approximator. The key is that each hidden 
unit may require arbitrarily many affine components. 
In particular, we show below that a maxout model 
with just two hidden units can approximate, arbitrar- 
ily closely, any continuous function of a; £ R d . A di- 
agram illustrating the basic idea of the proof is pre- 
sented in Fig. 3. 

Consider the continuous piecewise linear (PWL) func- 
tion g{x) consisting of k locally linear (affine) regions 
on R d . 

Proposition 4.1 (From Theorem 2.1 in (Wang, 
2004 ) ) For any positive integers m and d, there always 
exist two groups of d + 1- dimensional real-valued pa- 
rameter vectors [W%j,bij],j G [l,k] and [W2.J, &2j], j € 
[1, k] such that: 



g(x) = hi(x) - h 2 {x) 



(1) 



That is, any continuous PWL function can be ex- 
pressed as a difference of two convex PWL functions. 
The proof is given in (Wang, 2004) an d omitted here 
for brevity. 

Proposition 4.2 From the Stone- Weierstrass ap- 
proximation theorem, let f : C — > K be a continuous 
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Figure 3. An MLP containing two maxout units can arbi- 
trarily approximate any continuous function. The weights 
in the final layer can set g to be the difference of hi and 
fi2 . Provided that Z\ and 22 are allowed to have arbitrarily 
high cardinality, hi and /12 can approximate any convex 
function, g can thus approximate any continuous function 
due to being a difference of approximations of arbitrary 
convex functions. 



any positive real number. Then there exists a contin- 
uous PWL function g, (depending upon e), such that 
for all x G C, \ f(x) - g(x)\ < e. 

Theorem 4.3 Universal approximator theorem. Any 
continuous function f can be approximated arbitrar- 
ily well on a compact domain C C M. d by a maxout 
network with two maxout hidden units. 

Sketch of Proof By Proposition 4.2, any continuous 
function can be approximated arbitrarily well (up to 
e), by a piecewise linear function. We now note that 
the representation of piecewise linear functions given 
in Proposition 4.1 exactly matches a maxout net with 
two hidden units h\{x) and h2(x), with sufficiently 
large k to achieve the desired degree of approximation 
e. Combining these, we conclude that a two hidden 
unit maxout network can approximate any continuous 
function f{x) arbitrarily well on the compact domain 
C . In general as e — > 0, we have k — ¥ 00. 

5. Benchmark results 

We evaluated the maxout model on four benchmark 
datasets and set the state of the art on all of them. 
Maxout shows not only a clear improvement over pre- 
vious methods but also shows a strong enhancement 
of dropout's abilities. 

5.1. MNIST 

The MNIST (LeCun et al, 1998) datasct consists of 
28 x 28 pixel greyscale images of handwritten digits 
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Table 1. Test set misclassification rates for the best meth- 
ods on the permutation invariant MNIST dataset. Only 
three methods outperform the maxout MLP, and all of 
these rely on unsupervised pretraining. 



Method 


Test error 


Rectifier. MLP + dropout 


1.10% 


(HlNTON ET AL., 2012) 




DBM (Salakhutdinov & Hin- 


0.95% 


ton, 2009) 




Maxout MLP + dropout 


0.94% 


Deep Convex Network (Yu & 


0.83% 


Deng, 2011) 




Manifold Tangent Classifier 


0.81% 


(Rifai et al., 2011) 




DBM + DROPOUT (HlNTON 


0.79% 


ET AL., 2012) 





0-9, with 60,000 training examples and 10,000 test ex- 
amples. 

Traditionally methods are evaluated in separate cate- 
gories depending on whether they exploit the fact that 
the examples have image structure or not. For the per- 
mutation invariant version of the MNIST task, only 
methods unaware of the 2D structure of the data are 
permitted. In this case, we trained a model consist- 
ing of two densely connected maxout layers followed 
by a softmax layer. Besides using dropout, we fur- 
ther regularized the model by imposing a constraint 
on the norm of each weight vector. All constraint val- 
ues, learning rate and momentum schedule parame- 
ters, layer sizes, etc. were selected by minimizing the 
error on a validation set consisting of the last 10,000 
training examples. In order to make use of the full 
training set, we recorded the value of the log likelihood 
on the first 50,000 examples at the point of minimal 
validation error. We then continued training on the 
full 60,000 example training set until the validation 
set log likelihood matched this number. We obtained 
a test set error of 0.94%, which is the best result of 
which we are aware that does not use unsupervised 
pretraining. We summarize the state of the art results 
on permutation invariant MNIST in Table 1. 

We also considered the MNIST dataset without the 
permutation invariance restriction. In this case, we 
used three convolutional maxout hidden layers (with 
spatial max pooling on top of the maxout layers) fol- 
lowed by a densely connected softmax layer. We were 
able to rapidly explore parameter space thanks to the 
extremely fast GPU convolution library developed by 
Krizhevsky et al. (2012). We obtained a test set error 
rate of 0.45%, which sets a new state of the art in this 




Figure 4. Example filters learned by a maxout MLP 
trained with dropout on MNIST. Each row contains the 
filters whose responses are pooled to form a single maxout 
unit. 



Table 2. Test set misclassification rates for the best meth- 
ods on the general MNIST dataset, excluding methods that 
augment the training data. 



Method 


Test error 


2-LAYER CNN + 2-LAYER NN (JAR- 


0.53% 


RETT ET AL., 2009) 




Stochastic pooling (Zeiler & 


0.47% 


Fergus, 2013) 




Conv. maxout + dropout 


0.45% 



category. Note that it is possible to get better results 
on MNIST by augmenting the dataset with transfor- 
mations of the standard set of images (Ciresan et al., 
2010) . A summary of the best methods on the general 
MNIST dataset is provided in Table 2. 

5.2. CIFAR-10 

The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) 
consists of 32 x 32 color images drawn from 10 classes. 
The training set contains 50,000 images and the test 
set contains 10,000. We preprocessed the data us- 
ing global contrast normalization and ZCA whiten- 
ing. This is the same preprocessing applied by (Coates 
et al., 2011) to individual patches of the dataset in the 
context of unsupervised modeling of patches. 

We follow a similar procedure as with the MNIST 
dataset, with one change. On MNIST, we find the 
best number of training epochs in terms of validation 
set error, then record the training set log likelihood 
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Table 3. Test set misclassification rates for the best meth- 
ods (excluding data augmentation) on the CIFAR-10 
dataset. 



Method Test error 



Stochastic pooling (Zeiler & 


15.13% 


Fergus, 2013) 




CNN + Spearmint (Snoek 


14.98% 


ET AL., 2012) 




CONV. MAXOUT + DROPOUT 


12.93 % 



and continue training using the entire training set un- 
til the validation set log likelihood has reached this 
value. On CIFAR-10, continuing training in this fash- 
ion is infeasible because the final value of the learn- 
ing rate is very small and the validation set error is 
very high. Training until the validation set likelihood 
matches the cross-validated value of the training like- 
lihood would thus take prohibitively long. Instead, 
we retrain the model from scratch, and stop when the 
new validation set likelihood matches the value of the 
training set likelihood selected by cross validation. As 
with MNIST, we do not use data augmentation (such 
as training on translated and reflected versions of the 
images) and compare our results only to methods that 
do not use data augmentation. 

Our best model consists of three convolutional max- 
out layers followed by a fully connected maxout layer, 
then finally a softmax layer. This is similar to the ar- 
chitecture used by (Hinton et aL, 2012) except that 
our penultimate layer is fully connected instead of lo- 
cally connected. Using this approach we obtain a test 
set error of 12.93%, which improves upon the state of 
the art by over two percentage points. (If we do not 
train on the validation set, we obtain a test set error of 
14.05%, which also improves over the previous state of 
the art). A summary of the best CIFAR-10 methods is 
provided in Table 3. A visualization of the convolution 
kernels is shown in Fig. 5. 

As shown in Fig. 6, dropout was critical for obtain- 
ing good generalization error. Unlike previous results 
in which dropout reduces the generalization error by 
about 10%, maxout is specifically designed to enhance 
the effect of dropout, resulting here in a greater than 
25% reduction in the generalization error. 

5.3. CIFAR-100 

The CIFAR-100 (Krizhevsky & Hinton, 2009) dataset 
is the same size and format as the CIFAR-10 dataset, 
but contains 100 classes, with only one tenth as many 
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Figure 6. When training a maxout network, the improve- 
ment in validation set error that results from using dropout 
is dramatic. Here we find a greater than 25% reduction in 
our validation set error on CIFAR-10. 



Table 4. Test set misclassification rates for the best meth- 
ods on the CIFAR-100 dataset. 



Method 


Test error 


Receptive field learning (Jia 


45.77% 


& Huang, 2011) 




Stochastic pooling(Zeiler & 


42.51% 


Fergus, 2013) 




CONV. MAXOUT + DROPOUT 


38.57% 



labeled examples per class. Due to lack of time we 
did not cross- validate hyperparameters on CIFAR-100 
but simply applied the hyperparameters that yielded 
the best validation set performance on CIFAR-10. We 
obtained a test set error of 38.57%, which is state of 
the art (if we do not retrain using the entire training 
set, we obtain a test set error of 41.48%, which also 
surpasses the current state of the art) . A summary of 
the best methods on CIFAR-100 is provided in Table 
4. 

5.4. Street View House Numbers 

The SVHN (Netzer et ah, 2011) dataset consists of 
color images of house numbers collected by Google 
Street View. The dataset comes in two formats. We 
consider the second format, in which each image is of 
size 32 x 32 and the task is to classify the digit in the 
center of the image. Additional digits may appear be- 
side it but must be ignored. This is a difficult unsolved 
real- world task with potential commercial applications 
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Figure 5. Convolution kernels learned in the first layer of our CIFAR-10 network with k — 2. Each pair of filters appearing 
in a column together drive the same maxout convolution channel. 



Table 5. Test set misclassification rates for the best meth- 
ods on the SVHN dataset. 



Method 


Test error 


(Sermanet et al., 2012a) 


4.90% 


Stochastic pooling (Zeiler & 


2.80 % 


Fergus, 2013) 




CONV. MAXOUT + DROPOUT 


2.72 % 



for systems that achieve under 1% error. There are 
73,257 digits in the training set, 26,032 digits in the 
test set and 531,131 additional, somewhat less diffi- 
cult examples, to use as an extra training set. Fol- 
lowing Sermanet et al. (2012b), to build a validation 
set, we select 400 samples per class from the training 
set and 200 samples per class from the extra set. The 
remaining digits of the train and extra sets are used 
for training. 

For this dataset, we did not train on the validation set 
at all. We used it only to find the best hyperparam- 
eters. We preprocessed the data in the same way as 
(Zeiler & Fergus, 2013), by applying local contrast nor- 
malization on each of the RGB channels. Otherwise, 
we followed the same approach as on MNIST. Our best 
model consists of three convolutional maxout hidden 
layers (with spatial pooling on top of maxout layers as 
for MNIST) followed by a densely connected softmax 
layer. We used 128, 128 and 256 affine feature maps 
max-pooled in groups of 2, 2 and 4, respectively. The 
spatial pooling shapes were respectively (4, 4), (4, 4), 
and (2, 2) with a stride of 2 in all cases. We obtained a 
test set error rate of 2.72%, which sets the state of the 
art. A summary of comparable methods is provided 
in Table 5. 

6. Model Averaging 

Having demonstrated that maxout networks are effec- 
tive learning algorithms, we turn to analyzing the rea- 
sons for their success. We first identify reasons that 
maxout networks are highly compatible with dropout's 
approximate model averaging technique. 

The intuitive justification for averaging together 
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Figure 7. Sampling several models and taking the geomet- 
ric mean of their predictions approaches the error rate of 
the prediction made by dividing the weights by 2. How- 
ever, the divided weights still obtain the best test error, 
suggesting that dropout is a good approximation to aver- 
aging over a very large number of models. Also, note that 
the correspondence is more clear in the case of maxout. 

dropout models by dividing the weights by 2 given 
by (Hinton et al., 2012) is that this does exact model 
averaging for a single layer model, i.e. softmax regres- 
sion. To this characterization, we add the observation 
that the model averaging remains exact if the model 
is extended to multiple linear layers. While this has 
the same representational power as a single layer, the 
expression of the weights as a product of several ma- 
trices could have a different inductive bias. More im- 
portantly, it indicates that dropout does exact model 
averaging in deeper architectures provided that they 
are locally linear among the space of inputs to each 
layer that are visited by applying different dropout 
masks. 

We argue that the ensemble style training used in 
dropout encourages maxout units to be locally linear. 
Because each subset of the model (corresponding to 
one model in the ensemble) must make a good pre- 
diction of the output, each unit should learn to have 
roughly the same activation regardless of which of its 
inputs are dropped out. Thus, while a maxout net- 
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Model Averaging on MNIST 




Samples 



Figure 8. The KL divergence between the distribution pre- 
dicted using the dropout technique of dividing the weights 
by 2 and the distribution obtained by taking the geomet- 
ric mean of the predictions of several sampled models de- 
creases as the number of samples increases. This suggests 
that dropout does indeed do model averaging, even for deep 
networks. The approximation is more accurate for maxout 
units than for tanh units. 



work with arbitrary parameters will be far from locally 
linear in this space, a maxout network trained with 
dropout may have the identity of the maximal filter in 
each unit change relatively rarely as the dropout masks 
change. Thus networks consisting of linear operations 
and the max(-) could learn to exploit dropout's ap- 
proximate model averaging technique well. 

Many popular activation functions have significant 
curvature nearly everywhere. These observations sug- 
gest that the approximate model averaging of dropout 
will not be as accurate for networks incorporating such 
activation functions. To test this, we compared the 
best maxout model trained on MNIST with dropout 
to a hyperbolic tangent network trained on MNIST 
with dropout. We sampled several subsets of each 
model and compared the geometric mean of these sam- 
pled models' predictions to the prediction made using 
the dropout technique of dividing the weights by 2. 
We found evidence that dropout is indeed performing 
model averaging, even in multilayer networks, and that 
it is more accurate in the case of maxout networks. See 
Fig. 7 and Fig. 8 for details. 

7. Optimization 

The second key reason that maxout performs well is 
that it improves the bagging style training phase of 
the dropout algorithm. 



Note that the arguments in section 6 motivating the 
use of maxout also apply equally to rectified linear 
units (Salinas & Abbott, 1996; Hahnloser, 1998; Glo- 
rot et al., 2011). Maxout seems superficially similar 
to max pooling over a set of rectified linear units, 
which is equivalent to including a constant in the 
set from which maxout selects the max. However, we 
find that including this constant is very harmful to 
optimization in the context of dropout. For instance, 
on MNIST our best validation set error with an MLP 
is 1.04%. If we include a in the max, this rises to 
over 1.2%. In the context of dropout, we argue that 
maxout has superior optimization properties relative 
to max pooling over rectified linear units. 

7.1. Optimization experiments 

To verify that maxout yields better optimization per- 
formance than max pooled rectified linear units when 
training with dropout, we carried out two experiments. 
First, we stressed the optimization capabilities of the 
training algorithm by training a small (two hidden 
convolutional layers with k = 2 and sixteen kernels) 
model on the large (600,000 example) SVHN dataset. 
When training with rectifier units the training error 
gets stuck at 7.3%. If we train instead with maxout 
units, we obtain 5.1% training error. As another op- 
timization stress test, we tried training very deep and 
narrow models on MNIST, and found that maxout net- 
works cope better with increasing depth than rectifiers. 
See Fig. 9 for details. 



Different Number of Layers Error on MNIST 




Figure 9. We trained a series of models with increasing 
depth on MNIST. Each layer contains only 80 units (k=5 
for 400 filters) to make it difficult to fit the training set. 
Maxout optimization degrades gracefully with depth but 
rectifier units worsen noticeably at 6 layers and dramati- 
cally at 7. 



Maxout Networks 



7.2. Saturation 

Optimization proceeds very differently when using 
dropout than when using ordinary stochastic gradi- 
ent descent. SGD usually works best with a small 
learning rate that results in a smoothly decreasing ob- 
jective function, while dropout works best with a large 
learning rate, resulting in a constantly fluctuating ob- 
jective function. Dropout rapidly explores many dif- 
ferent directions and rejects the ones that worsen per- 
formance, while SGD moves slowly and steadily in the 
most promising direction. We find empirically that 
these very different operating regimes result in very 
different outcomes for rectifier units. When training 
with SGD, we find that the rectifier units saturate at 
less than 5% of the time. When training with dropout, 
this increases to 60% of the time. Because the in the 
max(0, z) activation function is a constant, and not 
a parameter as when a maxout unit is 0, this blocks 
the gradient from flowing through the unit. In the ab- 
sence of gradient through the unit, it is difficult for 
training to change this unit to become active again. 
Maxout does not suffer from this problem because gra- 
dient always flows through every maxout unit. Units 
that take on negative activations may be steered to 
become positive again later. Fig. 10 illustrates how 
active rectifier units become inactive at a greater rate 
than inactive units become active when training with 
dropout, but maxout units, which are always active, 
transition between positive and negative activations at 
about equal rates in each direction. We hypothesize 
that the high proportion of zeros and the difficulty of 
escaping them impairs the optimization performance 
of rectifiers relative to maxout. 

In order to investigate this hypothesis, we trained two 
MLPs on MNIST with the same architecture of 1200 
filters per layer pooled in groups of 5. When we in- 
clude a constant in the max pooling, the resulting 
trained model fails to make use of 17.6% of the filters 
in the second layer and 39.2% of the filters in the sec- 
ond layer. A small minority of the filters usually took 
on the maximal value in the pool, and the rest of the 
time the maximal value was a constant 0. Maxout, on 
the other hand, used all but 2 of the 2400 filters in 
the network. Each filter in each maxout unit in the 
network was maximal for some training example. All 
filters had been utilised and tuned to the classification 
task. 

7.3. Lower layer gradients and the bagging 
effect 

In order to behave differently from SGD, dropout re- 
quires that the gradient change noticeably when the 
choice of which units to drop changes. If the gra- 
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Figure 10. During dropout training, rectifier units transi- 
tion from positive to activation more frequently than they 
make the opposite transition, resulting a preponderence of 
activations. Maxout units freely move between positive 
and negative signs, moving in each direction at roughly 
equal rates. 



dient is approximately constant with respect to the 
dropout mask, then dropout simplifies to SGD train- 
ing. We tested the hypothesis that rectifier networks 
suffer from diminished gradient flow to the lower lay- 
ers of the network by monitoring the variance with 
respect to dropout masks for fixed data during train- 
ing of two different MLPs on MNIST. The variance 
on the gradient of the output weights was about 1.4 
times larger for maxout on an average training epoch 
step, while the variance on the gradient of the first 
layer weights was 3.4 times larger for maxout than 
for rectifiers. In concordance with our previous result 
showing that maxout with dropouts allows training 
deeper networks, this greater variance suggests that 
maxout better propagates varying information down- 
ward to the lower layers and helps dropout training to 
better resemble bagging for these lower-layer param- 
eters. Rectifier networks, with more of their gradient 
lost to saturation, presumably cause dropout training 
to behave more like regular SGD toward the bottom 
of the network. 

8. Conclusion 

In this paper, we have proposed a new family of func- 
tions called maxout that is particularly well suited for 
training with dropout, and for which we have proven 
a universal approximation theorem. We have shown 
empirical evidence that dropout attains a good ap- 
proximation to model averaging in deep models. We 
have shown that maxout exploits this model averag- 
ing behavior because the approximation is more accu- 
rate for maxout units than for tanh units. We have 
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demonstrated that optimization behaves very differ- 
ently in the context of dropout than in the pure SGD 
case. By designing the maxout gradient to avoid pit- 
falls such as failing to use many of a model's filters, 
we are able to train maxout networks on much larger 
training sets and with much deeper networks than is 
possible using rectifier units. We have also shown that 
maxout propagates variations in the gradient due to 
different choices of dropout masks to the lowest lay- 
ers of a network, thereby ensuring that every parame- 
ter in the model can enjoy the full benefit of dropout 
rather than SGD training and more faithfully emulate 
bagging training. More broadly, the state of the art 
performance of our approach on five different bench- 
mark tasks motivates the design of further models that 
are explicitly intended to perform well when combined 
with inexpensive approximations to model averaging. 
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